Mass Spectrometry Data: An Introduction
Context
Some brief context first. What I'm going to share is based on my experience during my PhD. I did not get any particular training in mass spectrometry or proteomics. While I am fairly confident in what I'm about to show you, I would appreciate any notes, corrections, addenda, or caveats be directed to mail@ on this site.My background is entirely from the computational side, no biology here. Therefore, this introduction to mass spectrometry data analysis will follow a computational angles: less discussion of what it means, biologically speaking, and more about what to watch out for, how to evaluate the data, how to process it, and so forth.
Mass spectrometry can be used at many other sauces which will not be covered here: I assume a peptide identification workflow using tandem mass spectrometry in shotgun proteomics (i.e. we never go all the way to proteins, in part because the discussion of error control at the level of proteins is even more complicated than at the peptide level; and we don't assume quantitative proteomics either). To keep things simple, I'll also assume all the data uses higher energy collisional dissociation (HCD) fragmentation at a fixed normalized collision energy (NCE).
Wetlab
- Cells (or other protein-bearing specimen) are collected from some source
- Proteins are extracted from the source
- (Optional) Proteins are separated based on some feature, like mass range
- Proteins are digested by some process, such as trypsin digestion, into peptides
- (Optional) Peptides are separated based on some feature, like hydrophobicity or (again) mass range
- Peptides are ionised and injected into the mass spectrometer
- The mass spectrometer directs the charged particles to a detector, analyzes the raw readings, and outputs a mass spectrum
- The data is analyzed by identification tools, etc.
What's in a Spectrum?
After the wetlab part of a proteomics workflow, we're typically left with proprietary vendor-specific files, such as Thermo-Fisher .RAW or AB SCIEX .WIFF. A lot of people publish those proprietary files instead of doing conversion first, so the first step is to get the data into shape. The main options are mzXML (and XML-like format, which I wouldn't recommend), or MGF (the mascot generic format, sometimes also called mascot general format, although that's incorrect).
The MGF format is not standardized, but still routinely used. It's rather easy to work with, easy to read and even easy to edit by hand. It looks like this:
BEGIN IONS CHARGE=3+ SEQ=PEPTIDE SCAN=1435 SCANS=1435 TITLE="mgf_library_17/first_pool/f1000.raw" 123.123 99999.123123 234.234 8980.1727 ... 1024.554 324.4 END IONS
Features of interest include:
BEGIN IONS
andEND IONS
. I have never seen a mgf file fail to place or write them in exactly the same way.- Field order and filed types: neither are defined and they vary a lot. One source may have the
SEQ=
field, another will not. One may call theirSEQ=
fieldSEQUENCE=
instead, and so forth. - Redundant fields:
SCAN=
andSCANS=
for instance. Inconsistent values in fields:CHARGE=3+
may be spelledCHARGE=3
(no +) in another source. - Inconsistent separators: some mgf's use spaces to separate m/z and intensity values, others use tabs. Some files will also have blank lines all over, while others will have none
Basic QA
Basic Cleanup
Empty or near-empty spectra can fairly safely be discarded in most cases, it won't be possible to get much out of them unless, perhaps, if the proteome under study is very sparse. It's also common to remove spectra with very few or many peaks (through an arbitrary threshold, like ≤ 20 or ≥ 500), as they are likely to be pure noise. The resulting spectra are also often intensity-culled: the intensity values are divided by the max intensity in each spectrum, and peaks with intensity not exceeding 1% or 5% of the max intensity (i.e., after the division, having values of 0.01 or 0.05 respectively), for example, are removed. Sometimes, a transform such as log or square-root is used either before or after the cull. Another common processing step is to correct the predicted charge provided by the instrument. This is done by locating peaks which are 1/z Da apart for some charge level z, corresponding to the mass difference under the hypothesis that the peptide is of charge level z for the isotopes of carbons. Depending on dataset, 1-10% of the data's charge may have been incorrectly allocated. This can also be observed when high intensity peaks are widely below or above the m/z reported in the file. A related transformation is the removal of isotope peaks, called "deisotoping". However, in my experience, deisotoping has a high false positive rate. The result is reduced identification rates on cleaner datasets (but it does tend to help on dirtier data). When sequence identities are available, a dataset can further be quality-controlled by removing spectra whose precursor exact mass is too far from the theoretical mass of the sequence. This can be seen as a more refined version of charge correction (since exact mass is usually wrong on account of incorrect charge in the spectrum file).In deep learning, a common transformation is to "rasterize" the spectrum into a fixed length vector (for example, a vector where each element represents a further 0.1 Da increment, so the intensity of peaks 1193.12 and 1193.14 would be summed together within the vector element 11930 (0-based)), which amounts to inverting the peak picking transform (but not recovering all the potentially lost data from the peak picking process)... Would such algorithms perform well on profile-mode spectra? If only there was a way to find that out!
Generating a Spectrum
We are interested in peptide identification, and the state of the art is not much more than in-silico digestion of candidate proteins, filtering to the spectrum's precursor mass (within tolerance), and some kind of correlation metric. Figure 4 shows a typical peptide (the R# are "residues" attached to the "peptide backbone" -- an amino acid's name identifies this so-called residue in particular). The physical model of fragmentation typically considered admits fragmentation events at some specific bounds on the molecule, as depicted in the image. Those bonds are labeled based on their location and order. With HCD, the main fragments are expected to be of type b or y. The ions generated by the fragmentation following such a pattern forms a series, so here we are interested in the "b-ion series" or the "y-ion series". More complex models also consider so-called neutral masses, the loss of uncharged molecules from the fragment. However, their consideration is still not all that common, and in my experience, they don't tend to be any more likely to be present in the spectrum (on average) than just the b- and y-ions (refer to Figure 1), so the added noise does not seem that useful to me. On the flipside, we now have amazing deep learning models like PredFull, which generate very accurate predictions for the entire spectrum (not just those series and/or neutral losses). A major caveat is that this kind of model does not seem to generalize too well to other species yet. But I digress.
To generate a simple b- and y-ion series theoretical spectrum, the first step is to acquire a protein database. A common resource to find one is the UniProt database. The next step is to do in-silico digestion. If we're using the typical trypsin, the regular expression rule (as per the Expasy PeptideCutter tool is
([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))
, which means, roughly, "Cut after lysine or arginine, but not if followed by proline, except if it's a lysine preceded by a tryptophan or an arginine preceded by a methionine (in which case, do cut despite the proline)". Applying this rule gives us all potential digestion targets, but real life does not work like that, and trypsin will not fully digest everything for a variety of reasons. To generate our spectrum, we have to consider at least missed cleavage, that is the sites that don't get digested. It's just a matter of sticking consecutive bits together after a full theoretical digestion. It's typical to consider up to 2 missed cleavage, which means we have to add to our peptide pool not just the fully digested protein, but also all pairs and all triplets of digested products that are consequent in the direction of the protein sequence. Not covered are more complications, like semi-tryptic digestion or post-translational modifications (PTM), such as oxidation or fixed modifications like carbamidomethyl cysteine (CAM), or more complicated models including neutral losses and other events.Now, all we have to do is compute the mass sequence resulting from a candidate peptide sequence at a given charge level (remember: we need m/z, not just m). This is simple: we just sum the individual amino acid masses, the mass of the two peptide termini, mass changes due to post-translational modifications, and then we just deal with the charge. Given that to be of charge z, a peptide must have z-1 extra protons compared to the charge 1 molecule, we have to remember to account for that mass before dividing by the charge to obtain our m/z. Mind the mass difference between the resulting calculation for the y- and b-series (or, if using, other series like the a, z, c or x). Peptide search engines, various utilities, and online resources are all available to compare results, such as Protein Prospector's MS-Product.
Conclusion
Now that you have your data and your theoretical spectra, peptide identification is simply about selecting the subset of peptide sequences in your database that are relevant to the current spectrum based on mass and charge, then computing a similarity score. Designing the right score is an active area of research. Current search engines are not designed with computational research in mind, which means that any improvement has to fit the box setup by another engine, or must be implemented in a novel search engine. I believe that this is a major hurdle to modern proteomics as a whole, which is why a significant portion of my PhD was spent developing an research-first platform (which ended up running faster than common engines like x!tandem, and, if electing to eschew the speed, can generate 20% more identifications at the same controlled error rate than Comet in early tests).The last part of the story (although there are some interludes, such as rescoring, if one is so inclined) is error control. The datasets I link to in the beginning of this post have ground truth peptide sequences available, which makes evaluating search engine improvements straightforward in this context. But how can we get confident identifications from our scoring function with a novel sample set, when we don't know what the actual peptides in the sample are? The common approach today is to do a search with not just the digested proteins, but the digested inverted proteins (there are other approaches, but they are less robust in terms of error control in general). Peptides generated this way that are already found in the real candidate database are removed, and the remaining "decoy" peptides are supposed to be similar enough to the real ones that the search engine is supposed to assign false hits uniformly at random between our real false hits and our decoy database. Thus, we can simply use formula (decoy hits) / (true hits + false hits) under the assumption that decoy hits = false hits (so it is equivalent to (false hits) / (true hits + false hits)) to estimate the false discovery rate (FDR). This approach is called "target-decoy approach" or TDA. Unfortunately, this method is not robust and frequently overestimates true false discovery rates by factors as high as 5 to 10 times. Moreover, it prevents comparing identification results between databases, because the distribution of their hits differs and they over or underestimate false discovery completely differently.
Unfortunately, not only is this method widespread, modern research efforts are widely directed at breaking the required "decoy hits = false hits" assumptions in various ways. I instead propose that, at the very least, any new proposed identification approach must be evaluated in terms of true false discovery rate (sometimes referred to as "factual FDR" or, more commonly, as false discovery proportion (FDP)).
Perhaps rescoring or TDA-FDR will be the topic of another post. In any case, you are now armed with enough knowledge to be dangerous with mass spectrometry data. Watch your steps, good luck, and happy hacking!
See my followup post for a worked example of a peptide identification workflow.